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the benzyloxycarbonyl group. Digestion of a 30-residue peptide begin- 
ning Gly-Pro-HyPro- with this new enzyme yielded Gly-Pro only, an indi- 
cation of its specificity and ability to act on longer polypeptides. Presum- 
ably, a similar enzyme for use in the DCP system, when necessary, should 
also become available at some future date. 

In summary, the dipeptidyl peptidase approach to polypeptide se- 
quence analysis is a useful alternative to other methodologies for these 
determinations. With assets of speed and sensitivity, as well as the ability 
to proceed from either the NH2 or COOH terminus, it should be the 
method of choice in a number of situations. 



[47] Establishing Homologies in Protein Sequences 

By Margaret O. Dayhoff, Winona C. Barker, and Lois T. Hunt 

That different species contain homologous proteins was known long 
before the exact chemical sequences had been elucidated. Although it was 
clear that the homologous structures were not identical, nevertheless, 
mixed systems that functioned perfectly well chemically could be con- 
structed with enzymes from different species. The results from protein 
sequence determinations over the last 30 years have made clear the nature 
of the homologous structures. There is an ongoing process of mutation 
and selection whereby a normal protein of a species can change from one 
form to a slightly altered form. The accepted mutations are of two princi- 
pal kinds: point mutations, including alteration of one nucleotide of the 
triplet coding for one amino acid and deletions or insertions of one or a 
few whole codons; and large changes in the amount of genetic material, 
believed to be caused by unequal crossing-over of the chromosomes, 
resulting in duplications or deletions that can include entire genes. When 
gene pools become isolated, through either a separation of interbreeding 
populations or a duplication of genetic material within a species, the 
copies gradually acquire changes independently of one another. At first 
the sequences are so similar that there is no question about their common 
origin. With increasing time more and more change occurs until it may no 
longer be possible to recognize the similarity. 

Frequently the biochemist is confronted with the problem of identify- 
ing a newly determined protein sequence or a protein sequence inferred 
from a nucleotide sequence. If proteins are less than 30% different from 
each other, then similarity can often be detected immunologically. DNA 
coding regions can be identified by annealing if the nucleotide sequences 

Copyright © 1983 by Academic Press, Inc. 
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are less than 30% different. The identification of relationships between 
proteins that are up to 75 or 80% different can be accomplished by com- 
parison of the sequences. 

In this chapter we will be particularly concerned with statistical tests 
capable of illuminating even very distant relationships. These tests are 
based on the hypothesis that the sequence under consideration and an- 
other selected sequence are more similar than one would expect by 
chance. Sequences can be selected for consideration by many criteria. 
Frequently they are chosen because they are similar in some aspects, for 
example, chemical function, active site, prosthetic group, unusual modifi- 
cation of a residue, tertiary structure, interaction with other molecules, 
amino acid composition, immunological similarity, physiological activity, 
or because of strong similarity of short sequence segments. They can also 
be selected by examining sequences for known active sites, by comparing 
parts of sequences to tabulations of sequences or to an alphabetized listing 
of the sequences of known segments, or by performing a computer search 
of a segment against all the known sequences. 

Two types of statistical tests are in common use. In one, all the seg- 
ments of a given length in one sequence are compared with all the seg- 
ments in the second sequence.*"® In the other, the best alignment of two 
sequences is made.*'^~* We have based our computer programs RELATE 
and ALIGN on these two methods. In both methods, a scoring matrix is 
required and a numerical property of the comparison is calculated. This 
same property is also calculated for a large number of pairs of permuted 
sequences (with the same compositions as the real sequences). The mean 
and standard deviation of the property are estimated from the distribution 
of scores of the permuted sequences. An assessment of the probability of 
the real score occurring by chance can then be made on the basis of the 
probabilities of standardized scores for the normal distribution (Fig. 1). 
These methods focus on the pattern in the sequence and do not include 
any contribution from similarity in the amino acid composition of the 
proteins. In sequences of nearly average composition, the contribution of 

* M. O. Dayhoff, in "Atlas of Protein Sequence and Structure" (M. O. DayhoflF, ed.), Vol. 5, 
Suppl. 3, pp. 1-8. National Biomedical Research Foundation, Washington, D.C., 1979. 

2 W. M. Fitch, y. Mol. Biol. 16, 9 (1966). 

3 W, M. Fitch, J. MoL Biol. 49, 1 (1970). 

^ W. C. Barker and M. O. Dayhoff, in "Atlas of Protein Sequence and Structure 1972" 
(M. O. Dayhoff, ed.). Vol. 5, pp. 101-110. National Biomedical Research Foundation, 
Washington, D.C., 1972. 

5 S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48, 443 (1970). 

« P. H. SeUers, SIAM J. Appl. Math, 26, 787 (1974). 

7 P. H. Sellers, Proc. Natl. Acad. Set. U.S.A. 76, 3041 (1979). 

8 T. F. Smith, M. S. Waterman, and W. M. Fitch, 7. Mol. EvoL 18, 38 (1981). 
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Fig. 1. Probabilities of standardized scores for the normal distribution. This figure was 
taken, with permission, from Table 35 in the ** Atlas of Protein Sequence and Structure," 
Vol. 5, Suppl. 3, p. 374, 1979, 

composition is weak. For sequences of uncommon composition, the cor- 
rect null hypothesis is not well understood. For example, if polylysine 
were found in two organisms, would this represent common ancestry or 
would it derive from poly(A) being incorporated into genes in two entirely 
separate events? Methods for inferring relationships from amino acid 
compositions have been studied*"**; we will not consider them here. In 

» J. J. Marchalonis and J. K. Weltman, Comp. Biochem. Physiol. B 38B, 609 (1971). 
i« C. E. Harris and D, C. Teller, J. Theor, Biol. 38, 347 (1973). 
1^ A. Comish-Bowden and A. Marson, 7. MoL Evol. 10, 231 (1977). 
1^ A. Comish-Bowdcn, Biochem. J. 191, 349 (1980). 
1^ H. M. Shapiro, 5/oc/i?Vw. Biophys. Acta 236, 725 (1971). 
H. VogeI,y. Mol. Evol. 6, 271 (1975). 



[47] 



ESTABLISHING HOMOLOGIES IN PROTEIN SEQUENCES 



527 



any case, the overall statistical probability would be derived by multiply- 
ing that from composition information and that derived from sequence 
pattern, since these properties are independent. 

Selection Using Printed Tabulations 

In scanning known sequences by eye, one usually concentrates on 
regions containing residues that are most highly conserved: Cys, Trp, and 
the two residues Tyr and Phe (which substitute principally for one an- 
other). Where these residues match, the rest of the residues should be 
examined. If the relationship is so strong that 40% of the residues are 
identical over 25 residues or more, without introducing breaks or gaps in 
either sequence, the relationship is quite definite and statistical tests are 
not necessary. Sequences more than 50% different usually require one or 
more breaks to align them for maximum homology. One usually must 
resort to statistical tests to evaluate the more distant relationships; unre- 
lated sequences are only 72 ± 6% different, excluding positions with a 
gap, when an unlimited number of gaps are permitted and the penalty for 
making a break in either sequence is equal to the score for one match. 

The "Protein Segment Dictionary," a key-amino-acid-in-context list- 
ing of all 15-residue segments from the sequences known in 1977, al- 
phabetized on the sixth and following amino acids, can be referenced for 
exact matches to short sequences of three or more amino acids (Fig. 2).^^ 
Because fewer than 2% of all possible sequences of five amino acids and 
0.2% of those sequences six long actually occur in this collection of se- 

Segment 

Protein Key 

Code No. Key 

TTBOB 106 KTNYC TKPQKSYM* 

j TUHU 1 + TKPR* 

, ^tGdHUVN 125 VHNAK TKPREEOFBS 

(G1HUEU 289 VHNAK TKPREQQY8S 

G2GP 170 VGNAE TKPRVEQYBT 

CBB05 88 DRSKI TKPSES* 

' — Ig gamma chains 

Phagocytosis stimulatmg peptide 

Fig. 2. A portion of the * 'Protein Segment Dictionary."** The sequence of the 
phagocytosis-stimulating peptide is completely contained in the sequences of the immuno- 
globulin y chains and nowhere else in the collection. This suggests that the peptide may be 
derived chemically from the y chain, particularly since it is known to function in association 
with the y chain. 

i» M. O. Dayhoff, L. T. Hunt, W. C. Barker, R. M. Schwartz, and B. C. Orcutt, '^Protein 
Segment Dictionary 78." National Biomedical Research Foundation, Washington, D.C, 
1978. 
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quenced proteins, a search for a specific penta- or hexapeptide will usually 
turn up the source protein and its close relatives or nothing at all. 

The new sequence should be examined for known active sites. These 
have often been conserved over large phylogenetic distances. For exam- 
ple, the sequence Gly-Asp-Ser-Gly-Gly surrounding the serine active site 
of trypsin is absolutely conserved in all 19 of the sequenced serine pro- 
teases related to trypsin, including four bacterial sequences, and it occurs 
nowhere else in our present protein sequence database. 

Identification of Very Similar Sequences 

In identifying short segments that are identical with or very similar to a 
test piece, as in the problem of identifying the source of a peptide believed 
to originate by chemical degradation of a larger protein, a search using the 
unitary (or identity) matrix is appropriate. Alternatively, the sequence can 
be looked for in the '^Protein Segment Dictionary."*^ 

For a segment identification to be possible, the residues that have been 
sequenced need not be contiguous and the amino acids do not need to be 
unambiguously identified. However, such searches are usually practical 
only with a computer. An exact match of from six to nine amino acid 
residues (depending on the frequency of occurrence of these amino acids) 
should suffice to identify uniquely a segment among all human se- 
quences.*® 

Computer Methods: The Database 

When relationships are not immediately obvious one can use computer 
methods to compare the new sequence to the other known sequences, 
which are conveniently accessible online in the Protein Sequence 
Database.*^ In this database, we have organized the known sequences 
into hierarchical groups of superfamilies, families, subfamilies, and en- 
tries.*^ The number in each group, the criteria for clustering, and the 
method of identification of the hierarchical levels is shown in Table I. The 
sequences are clustered into superfamilies of sequences that can be shown 
to be significantly related by statistical methods. Within the superfamily, 

i« M. O. Dayhoff and B, C, Orcutt, Proc. Natl, Acad, Sci. U.S,A. 76, 2170 (1979). 

Protein Sequence Database (M. O. Dayhoff, L. T. Hunt, W. C. Barker, B. C. Orcutt, L.-S. 

Yeh, H. R. Chen, D. G. George, M. C. Blomquist, J. A. Fredrickson, and G. C, Johnson). 

National Biomedical Research Foundation, Washington, D.C., 1981. 
i» M. O. Dayhoff, W. C. Barker, L. T. Hunt, and R. M. Schwartz, m '*AtJas of Protein 

Sequence and Structure" (M. O. Dayhoff, ed.). Vol, 5, Suppl. 3, pp. 9-24. National 

Biomedical Research Foundation, Washington, D,C., 1979. 
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TABLE I 

SUPERFAMILY ORGANIZATION 



Number of residues 



Number in 




Criteria for 


in first sequences 


group 


Group 


clustering 


of each group 


512 


Superfamilies 


Probability of similarity 


98,394 






by chance < 10~* 




774 


Families 


<50% different 


142,045 


1161 


Subfamilies 


<20% different 


192,097 


1667 


Entries 


<5% different 


258,126 



we have divided the sequences into families. Proteins within a family 
usually differ at fewer than half of their amino acid positions; their similar- 
ity of function has often been recognized before the sequences were 
known, and they have identical or very similar names. The sequences 
have been further divided into subfamilies. Sequences within a subfamily 
usually diflFer from each other at fewer than 20% of their amino acid 
positions. Within a subfamily, sequences that differ by less than 5% are 
usually described in a single database entry. If the rate of change in amino 
acid sequences were exactly proportional to time, such a clustering would 
identify the twigs, branches, and boughs of the evolutionary tree. The rate 
has been only approximately constant, and so, in a clustering procedure 
such as this, there will always be cases that are borderline, some pairs 
within a group being below the cutoff and some above. In spite of difficul- 
ties in a few details, Table I gives a very good impression of the sequence 
information that is available. At present the number of newly investigated 
superfamilies is doubling in about 3 years and the number of newly se- 
quenced residues is doubling in about 4 years. 

The clustering procedure also gives a basis for selecting sequences for 
a given purpose. For a minimal database for computer searches, it would 
be adequate to select one sequence from each family and thereby reduce 
the costs of searching by a factor of 2. 

We postulate that there may be only 1000 superfamilies of functional 
proteins, some containing several groups of very distantly related pro- 
teins. The probability, 10~®, for clustering sequences into a superfamily 
has been chosen so that, in making all possible comparisons of 2000 se- 
quences, each representing one group of clearly related sequences (4 x 
10^ comparisons), the number of groups placed into the wrong superfam- 
ily will be very small, approximately 4. If one used a probability of 10~^ 
for significance, one would expect about 40 misclassifications, or an error 
in 4% of the groups. As more information becomes known on the charac- 
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teristics of the superfamilies, such misclassified sequences will be evident. 
Also, as more sequences are known for any one superfamily, the pattern 
of conserved residues will become clearer. Incorporating this information 
into the statistical tests, through weighting of the conserved residues, will 
increase the detectability of related sequences. 

Our superfamily definition is based on sequence information so that a 
relationship can be readily verified by objective methods. Some of the 
superfamilies may have had a very distant common evolutionary origin, 
which may be demonstrated on the basis of other evidence, such as the 
positions of a-carbon atoms from X-ray cry stall ography.^®*^** Many of the 
problems involved in the statistical detection of very distantly related 
sequences have recently been discussed by Doolittle.^* 

Searching the Database 

In investigating a new sequence of unknown relationship to form hy- 
potheses of evolutionary similarity to other sequences in the database, we 
first use the computer program SEARCH. ^ We select several test seg- 
ments of, for example, 25-residue length from different regions of the new 
sequence. Each of these is compared with all 25-residue segments and the 
shorter end segments of each known sequence. A score for the segments 
is accumulated by adding the pair scores of each amino acid in the seg- 
ment searched with the corresponding amino acid in each segment of the 
database. The pair scores are contained in a matrix. Several scoring sys- 
tems, summarized in matrices, are in common use (see below). The 
simplest system assigns a score of 1 for identities and 0 for nonidentities. 
Table II shows the average numbers of identical residues found in unre- 
lated segments in a search of the database. Segments were 25 residues in 
length, and 218,000 segments were compared in each of 15 searches. From 
these numbers we derived the number of high-scoring segment pairs to be 
expected in a RELATE run and the probability of finding a pair of seg- 
ments with at least a given number of identities. For sequences that are 
related, the similarity extends beyond the segment searched, whereas for 
unrelated sequences it usually does not. 

For detecting distant relationships, we have found the mutation data 
matrix to be best^^ (see Fig. 3). The distribution of scores from unrelated 

i» p. Keim, R. L. Heinrikson, and W, M. Fitch, 7. Mol. Biol. 151, 179 (1981). 

2« S. J. Remington and B. W. Matthews, Proc. Natl. Acad. Sci. U.S.A. 75, 2180 (1978). 

2» R. F. Doolittle, Science 214, 149 (1981). 

®^ R, M. Schwartz and M. O. DayhofiF, /« "Atlas of Protein Sequence and Structure" (M. O. 
Dayhoff, ed.). Vol. 5, Suppl. 3, pp. 353-358. National Biomedical Research Foundation, 
Washington, D.C., 1979. 
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TABLE H 

Identities in Unrelated 25-Residue Segments (No Gaps Permitted) 
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Fig. 3. The mutation data matrix, or log odds matrix for 250 PAMs (see text). Elements 
are shown multiplied by 10. The neutral score is 0. A score of — 10 means that the pair would 
be expected to occur only one-tenth as frequently in related sequences as chance would 
predict, and a score of +2 means that the pair would be expected to occur 1.6 times as 
frequently. The order of the amino acids has been arranged to illustrate the patterns in the 
mutation data. This figure was taken, with permission, from Fig. 84 in the '* Atlas of Protein 
Sequence and Structure," Vol. 5, Suppl. 3, p. 352, 1979. 
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^1 Seme family 

Some super fomily 
[ j Unrefated 



12 3 4 5 6 7 

NUMBER OF SEGMENTS 
Fig. 4, Histogram of scores for a search of bovine trypsinogen. The 25-residue segment 
used, positions 34 to 58, contains the histidine active site. Two distantly related eukaryote 
sequences, factor X and haptoglobin ^ chain, did not get high scores in this search. 



segments is approximately normal; related segments appear in an abnor- 
mally long tail of high scores. Typically, for a 25-residue segment, all 
corresponding sequences of the same family (<50% different) appear 
above the distribution of scores of unrelated segments (see Fig, 4), About 
half of the more distantly related sequences in the same superfamily are 
also above this distribution, whereas the rest may be within the upper tail 
of scores from unrelated segments. If the initial searches produce no 
scores above 50» probably there are no other sequences from the same 
family in the data collection. Searches using a very short segment, say 10 
residues, produce a distribution of unrelated scores having a very long tail 
of high scores that often obscures the real relationships. For longer seg- 
ments, say 40 residues, there are usually insertions or deletions in the 
sequences that interfere with obtaining high scores. 

In cases where no obvious relationship is found in a computer search, 
one may compare the top-scoring 30-40 segments with the test segment 
using several other criteria. For example: Are identical residues clustered 
{4-6 together)? Would the introduction of a single gap greatly improve the 
match? Does a matching segment occur in the same part of the molecule 
as the test segment, or is there the possibility of internal duplications if the 
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regions are different? Do the test protein and the matching protein have 
similar secondary or tertiary structures or similar functions or functional 
domains? Do test segment and matching segment share identities that bind 
the same functional moieties? Are both from the same organelle, cell type, 
or major group of organisms? Positive answers to such questions can 
increase the possibility of a relationship and indicate candidates for 
ALIGN or RELATE analyses. 

If the initial examination does not identify a relationship, we may 
search more exhaustively. Typically we search each successive segment 
of 25 residues. This provides for detection even in the case where only a 
fragment of a related sequence is present in the database. We will consider 
further two kinds of screening procedures. In the first, especially suitable 
for evaluation of search results by eye, all sequences that get a high score, 
>40, on at least one search are selected for statistical evaluation with 
programs ALIGN and RELATE. In the second, sequences containing a 
moderately high score, ^30, on two or more searches are selected. 

In any screening process, one must balance the probability and value 
of success against the cost of examining spurious suggestions. We conjec- 
ture that, at present, between 25 and 50% of all of the superfamilies to be 
found in organisms are represented in the database. If so, in more than 
half of the cases no related sequence will be found even with perfect 
detection methods. Even where there are very distant relationships that 
could be confirmed by statistical tests on the complete sequences, it may 
be very expensive or even impossible to find them by searching for short 
segments. Success depends on the related sequence containing enough 
regions giving high scores that these will be observed in the searches 
performed . 

For this chapter we searched a total of 101 segments from 13 se- 
quences: bovine trypsin ogen,^^ Chromatium vinosum high-potential iron- 
sulfur protein ,^^•2'* Escherichia coli thioredoxin,^® Streptomyces erythreus 
ribonuclease,^"^ horse alcohol dehydrogenase,^® Desulfovibrio vulgaris 
fllavodoxin,^* E. coli K12 dihydrofolate reductase type I,^** human 
antithrombin-III,^^ human ai-microglobulin,^^ E. coli 50S ribosomal pro- 
tein L7/L12,^'^^ bovine a-crystallin A chain,^^ human prolactin,^® and 
human thyrotropin oc chain.^^*^^ For each search, the highest score re- 

O. Mikes, V. Holeysovsky, V. Tomasek, and F. Sorm^ Biochem. Biophys, Res. Commun. 
24, 346 (1966). 

24 S. M. Tedro, T. E. Meyer, R. G. Bartsch, and M. D. Kamen, 7. BioL Chem. 256, 731 
(1981). 

25 K. Dus, S. Tedro, and R. G. Bartsch, /. Biol. Chem. 248, 7318 (1973). 
2« A. Holmgren, Eur, J. Biochem. 6, 475 (1968). 

2^ N. Yoshida, A. Sasaki, M. A. Rashid, and H. Otsuka, FEBS Lett. 64, 122 (1976). 
28 H. Jomvall, Ear. J. Biochem. 16, 41 (1970). 



534 



SEQUENCE DETERMINATION 



[47] 




^0 35 W i.5 50 55 60 

Score Z 



Fig. 5. Percentage of searches with a top score sZ. This curve is for a database of 
250,000 residues. As the database increases in size, the curve will shift to the right. One 
hundred and one segments of 25 residues, drawn from 13 sequences, were searched using the 
mutation data matrix. 

ceived by an unrelated segment was tallied. Figure 5 shows the percentage 
of searches containing a score as high or higher than a given value. The 
distribution of highest scores is approximately normal, with a mean of 
41.5 and a standard deviation of 4.1. On this basis, the probability of the 
top score being above 60 is <3.4 x 10~® and above 70 is 1.3 x 10"** in 



2» M. Dubourdieu and J. L. Fox,/. Biol. Chem. 252, 1453 (1974). 
D. R. Smith and J. M. Calvo, Nucl. Acids Res. 8, 2255 (1980). 

T. E. Petersen, G. Dudek-Wojciechowska, L. Sottrup-Jensen, and S. Magnusson,/rt "The 
Physiological Inhibitors of Blood Coagulation and Fibrinolysis" (D. CoUen, B. Wiman, 
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searches of the approximately 220,000 unrelated 25-residue segments in 
the current database. Such high scores are not independent of amino acid 
composition; they are found more often for segments that contain several 
Tyr, Phe, Trp, or Cys residues. 

In comparing one single segment with another, the probability of ob- 
taining a score above 30 {P 1.3 x 10"^) or above 40 {P = 5.9 x 10-^) 
can be estimated from the number of such scores obtained in the searches. 
There were 513 scores above 30 and 24 scores above 40 obtained from 
sequences not known to be related, in comparing almost 4 million seg- 
ments in 18 searches. Typically, for the present database, in searching ten 
25-residue segments, about 13 sequences will be identified with scores 
greater than 40, 68 with scores over 35, and perhaps 285 with scores over 
30. Sequences with the highest scores in one or more of the searches can 
be examined by eye for more extensive similarity. With a simple computer 
program one can locate sequences with two or more fairly high-scoring 
segments in approximate register. Very few such sequences are found in 
examining all scores of 30 or higher. 

The possibility of finding a related sequence by getting a high 
SEARCH score was also examined. In comparing all segments from the 
sequence searched with all those from the related sequence, the number 
of scores, 5, greater than or equal to a certain value, Z, was counted. The 
number of segments of the first sequence that could be chosen for a search 
is the length, L, minus 24. The probability, P (Z), of finding a high score in 
one search of an arbitrarily chosen segment is then given by P{Z) = 
SI{L — 24). The decision as to how many searches to make can be re- 
duced to a sampling problem based on this probability. 

For the 13 sequences that were searched, there were 28 cases in which 
a sequence from a diflFerent family was present in the database and the 
relationships could be detected using our computer programs ALIGN or 
RELATE, described below. The 28 pairs of related sequences were 
grouped by percentage difference. Table HI shows the probabilities of a 
segment giving scores equal to or greater than 40 and 30 for the related 
sequences that we examined. These probabilities are calculated from 
the average percentage of the segments that would give a sufficiently high 
SEARCH score. 

From these probabilities, one can readily calculate the probability of 
finding at least one segment with a score ^40, or two or more segments 
with a score ^30, in performing a specified number of searches. These are 
the probabilities of the success of the screening procedures when a related 
sequence is available. In Table IV we show the values for finding one or 
more scores, Ps, and two or more scores, Pjg, in making A'^ = 5 and A'^ = 10 
searches. If the screen involves selecting sequences with two scores ^30 
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TABLE III 

Probability of a High Search Score as a 
Function of Sequence Difference 



% Difference" 




P{40) 


50-55 






55-60 


0.69 


0.53 


60-65 


0.59 


0.37 


65-70 


0.40 


0.22 


70-75 


0.47 


0.27 


75-80 


0.28 


0.10 



" Sequences were aligned using the mutation 
data matrix with a bias of 6 and a gap penalty 
of 6. The percentage difference was formed by 
counting the number of positions with identi- 
cal residues divided by the total number of 
positions containing residues in both se- 
quences. Positions with a gap in either se- 
quence were ignored. 

* P (30) is the probability that a score of 30 or 
greater will be obtained for the homologous 
segment on a single search. 



TABLE IV 
Probability** of Success in the Search 
Screening Procedure 



p 






PF 


P?o 


0.1 


0.41 


0.65 


0.08 


0.26 


0.2 


0.67 


0.89 


0.26 


0.63 


0.3 


0.83 


0.97 


0.47 


0.85 


0.4 


0.92 


0.99 


0.66 


0.95 


0.5 


0.97 


1.00 


0.81 


0.99 


0.6 


0.99 


1.00 


0.91 


1.00 


0.7 


1.00 


1.00 


0,97 


1.00 


0.8 


1.00 


1.00 


0.99 


1.00 


0.9 


1.00 


1.00 


1.00 


1.00 


1.0 


1.00 


1.00 


1,00 


1.00 



" /* is the probability that the related segment 
will get a high score on a single search. /V is 
the probability that a related segment will get 
a high score on one or more of Is/ searches. Py 
is the probability that the related segments 
will get high scores on two or more of N 
searches. 
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in making 10 searches, then the probabiHty of finding a related sequence 
about 60-65% different [P(30) = 0.59] is 1.00. If it involves selecting 
sequences with one score ^40 [P(40) = 0.37], then the probabiUty of 
finding a related sequence is 0.98. The probability of missing a relation- 
ship is <2%. From Table IV it is readily seen that the probability of suc- 
cess of the screening procedure (when related segments are present) de- 
pends on the number of segments searched as well as on the probability of 
success in any one search. 

Alignment Scores: Program ALIGN 

If the results of the screening procedure using the SEARCH program 
suggest sequences that might be homologous, we proceed with simple ob- 
jective methods to find the probability that the similarity occurred by 
chance. Program ALIGN^*^ is particularly suitable for sequences of simi- 
lar length with matching segments positioned comparably. For any given 
alignment between two sequences a numerical value can be computed. 
The contribution of each match of a residue in one sequence with one in 
the other is defined by a matrix of values. A break may be introduced in 
either sequence, for which a gap penalty is incurred. The score for an 
amino acid matching a gap position is zero. The score for an alignment is 
the sum of scores for the positions along the alignment and the gap penal- 
ties incurred. Considering all possible alignments of the residu*es and any 
number of gaps, the basic algorithm of program ALIGN determines the 
maximum score possible and produces an alignment with that score. ^-^"^ 

The scoring matrix is constructed from an input matrix, typically the 
mutation data matrix, and a matrix bias parameter, B, that is added to all 
terms of the input matrix. The net effect of adding 5 is that the score for 
any given alignment is increased by B times the number of positions where 
a residue is aligned with another residue. If ^ is increased, the alignment 
with the maximum score frequently changes to one with more residues 
aligned, and it therefore has a shorter overall length. 

The maximum score, 5, that can be achieved by an alignment of a pair 
of real sequences is compared with the distribution of maximum scores for 
a large number (usually 100) of random permutations of the two se- 
quences. The mean and standard deviation of this approximately normal 
distribution are Sr and SD^. The alignment score, A, is the number of 
standard deviations by which the maximum score for the real sequences 
exceeds the average maximum score for the random permutations: A = 
(5 - 5r)/SDr (in SD units). 

The alignment score is thus expressed in units of standard deviations 
from the mean of random scores. The statistic A will be normally distrib- 
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uted with a mean of zero and a standard deviation of 1.0 if the sequences 
being compared are unrelated, if the scores from the randomized se- 
quences form a normal distribution, and if a sufficient number, A'^, of 
random scores is generated. The standard deviation of the mean is 1/V7v 
and the standard deviation of the standard deviation is 1/ V27v. We have 
calculated alignment scores for randomized sequences and for real se- 
quences that are unrelated, A chi-square test on the six intervals between 
— 3 and + 3 standard deviations from the mean indicates that, in this range, 
these scores reasonably (x^ = 17) ^ normal distribution.^ 

The alignment score in SD units is not proportional to evolutionary 
distance, and its value is affected by the parameters of a given compari- 
son. However, it is very useful for making statements regarding the prob- 
ability that the similarity observed could have occurred by chance. The 
probability that a score as high as that from the real sequences could have 
been obtained in a comparison of randomized or unrelated sequences can 
be determined from the cumulative standardized normal distribution, as 
shown in Fig. 1. The results of a comparison of human ai -microglobulin^^ 
and bovine lactoglobulin^* are shown in Fig. 6. The ALIGN program is 
our most sensitive tool for comparing sequences of similar length with no 
long internal duplications or rearrangements. Because the amount of time 
and core memory needed for the algorithm are approximately propor- 
tional to the product of the lengths of the two sequences, it can become 
impractical to use this program for very long sequences. 

There are several situations in which one is tempted to select portions 
of two sequences for comparison with, for instance, program ALIGN. 

1. The longer sequence appears to consist of domains, one or more of 
which may be homologous with the shorter sequence, and one do- 
main is selected for the comparison. 

2. One sequence is much shorter and appears homologous with a part 
of a longer sequence, and, therefore, only some of the longer se- 
quence is used. 

3. A portion of one sequence appears to be homologous with a portion 
of the other sequence, and only a part of each sequence is selected 
for comparison. 

As a general rule, one must remember that one has biased the results if 
the selection of the portions to be compared is based entirely on sequence 
similarity, especially if the similarity was discovered by examining a large 
number of sequences. For instance, if our database is searched with a 
segment 50 residues long, the test segment is compared with nearly 

G. Braunitzer, R. Chen, B. Schrank, and A. Stangl, Hoppe-Seyler' s Z. Physiol. Chem. 
354, 867 (1973). 
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Common F EC I 

Mutation Data Matrix (250 PAMs) + 6 
32 Identities out of 150 possible matches between residues 
7 Breaks, penalty for a break = 6 

Total score (r) = 1064 
100 random runs 
Mean score (m) = 960.65 
R-M = 103.35 

Standard deviation = 17.72 
Alignment score = 5.83 

Fig. 6. Alignment of human cti -microglobulin^^ and bovine lactoglobulin.^* These se- 
quences are 77% different, and the alignment score is 5.8, near the limit of detection. A bias 
of 6 and a gap (break) penalty of 6 were used. For distantly related sequences, there are 
frequently several alignments that obtain the same best alignment score. The alignment score 
can be significantly high even when there are many equally good alignments. 



180,000 segments. Even if these segments are all unrelated, the highest 
scoring of these will give a probability of about 10~^, corresponding to a 
score of 4.3 SD, if compared with the test segment using ALIGN or 
RELATE. 

It is preferable, particularly when trying to demonstrate a very distant 
relationship, to choose the sequence portions to be compared by indepen- 
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dent criteria such as domains that have been identified by X-ray crystal- 
lography,^* duplicated regions within a sequence, exons that have been 
identified by nucleotide sequencing, or fragments shown to have a particu- 
lar activity. When choosing segments to compare, there are many more 
possibilities than when choosing whole sequences. The probability of 10"® 
used for superfamily clustering is not appropriate; a lower value will be 
required, depending on the method of selection employed. 

Segment Comparison Scores: Program RELATE 

A second computer method for detecting unusual similarity between 
sequences compares all possible segments of a given length from one 
sequence with all segments of the same length from the second se- 
quenced"^ This method is particularly useful when it is not apparent 
which residues correspond; for example, if the sequences are of very 
different length, if the segments of one sequence are not colinear with 
related segments in the other, and if duplications are to be detected within 
a single sequence. Very long sequences can be compared, although the 
time required is proportional to the product of the lengths. A segment 
score is accumulated from the pair scores of the amino acids occupying 
corresponding positions within two segments. A scoring matrix is 
supplied by the user, as before. For example, if the length of the segments 
used is 25 and the total lengths of the two proteins compared are 100 and 
120 amino acids, then 7296 scores will be tabulated. At most, 76 of these 
will come from comparisons of corresponding segments. As with program 
ALIGN, numerical properties of the distribution of scores are determined 
for the real sequences and for a number of randomized sequences. A 
RELATE score (in SD units) is calculated as the difference between the 
value determined for the real sequences and the average value determined 
from the many pairs of randomized sequences, divided by the standard 
deviation of the values from the randomized sequences. The probability of 
occurrence of a particular RELATE score by chance can be found in Fig. 
1. We use two different kinds of numerical property. The first is the 
RELATE Magnitude, the average magnitude of a predetermined number 
of highest scores, usually chosen to be the number of scores to be ex- 
pected if the sequences are related (i.e., 76 in the above example). 

The program also calculates a second kind of numerical property, the 
RELATE Count score. A whole spectrum of Count scores, SC(A^), in SD 
units, is determined for each computer run (Table V).^^'^** For each of 
these, the numerical property is the number of scores at or above the A^th 
scoring interval. One can examine the spectrum to find the maximum 

A. Kurosky, D. R. Barnett, T.-H. Lee, B. Touchstone, R. E. Hay, M. S. Arnott, B. H. 
Bowman, and W. M. Fitch, Proc. Natl. Acad. ScL U.S.A. 77, 3388 (1980). 
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TABLE V 

Spectrum of Count Scores from Program RELATE" 



Cumulative values 



Lowest score 
of range 


No. 
(real) 


Mean No. 
(random) 


DifiFerence 


SD 
(random) 


Count score* 
(SD units) 


0 


2885 


2970,58 


-85.58 


223.35 


-0.383 


10 


544 


532.19 


11.81 


64.23 


0.184 


20 


175 


67.93 


107.07 


20.33 


5.267 


30 


86 


5.59 


80.41 


5.34 


15.050 


40 


52 


0.29 


51.71 


1.30 


39.626 


50 


14 


0.03 


13.97 


0.22 


62.730 


60 


5 


0.00 


5.00 


0.00 





" Human haptoglobin*** and bovine trypsinogen^^ were compared. 

^ The number of matches with scores s:20 clearly exceeds that expected by chance. 



SC(A^) and the corresponding number of segments with higher scores 
from the real sequences. For related sequences, sometimes there are only 
a few unexpectedly high-scoring segments, which are the result of pref- 
erential conservation of a small active site. Sometimes, as in the case of 
extensively conserved sequences, all of the corresponding segments get 
scores above the rest of the distribution. On occasion, the maximum value 
corresponds to a surplus of hundreds of scores above the expected score, 
but all well below the maximum score. This situation occurs when the se- 
quences are very distantly related and when they both have evolved 
through many repetitions of similar segments. No single comparison may 
be outstanding, but the numerous intercomparisons of the repeated seg- 
ments yield a bulge in the upper tail of the distribution. 

Several other kinds of output are obtained from this program, includ- 
ing an ordered list of many segment comparisons giving the highest 
scores, from which the regions of the sequences showing unusual similar- 
ity can be identified. From this list, a plot of the segments that match best 
in the two sequences can be made by hand or by computer as shown 
in Fig. 7.^^*^^ Longer contiguous sequences that match appear as diagonal 
lines on this figure and are easily interpreted by the human eye, 

^1 A. J. Gibbs and G. A. Mclntyre, £'fyr. J. Biochem, 16, 1 (1970). 

« J. V. Maizel, Jr., and R. P. Lenk, Proc. Natl. Acad, Sci. U.S.A. 78, 7665 (1981). 

W. C. Barker, L. K. Ketcham, and M. O. Dayhoflf, in "Atlas of Protein Sequence and 
Structure" (M. O. Dayhoff, ed.). Vol. 5, Suppl. 3, pp. 359-362. National Biomedical 
Research Foundation, Washington, D.C., 1979. 

T. Hase, H. Matsubara, and M. C. W. Evans, J. Biochem. {Tokyo) 81, 1745 (1977). 
M. Tanaka, T. Nakashima, A. M. Benson, H. F. Mower, and K. T. Yasunobu, Biochemis- 
try 5, 1666 (1966). 
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Chrammthtm vhtosum f erredoxin 



Fig. 7. The highest scoring segment pairs of Chromatium vinosum'^'^ and Clostridium 
pasteurianum^ ferredoxins determined by program RELATE. All segments eight residues 
long were compared using the mutation data matrix. Scores s22 are shown by solid boxes 
and those from 18 to 21, by open boxes. It is clear from the main diagonal line that the 
sequences match from one end to the other, but with an insertion in the Chromatium 
sequence between residues 37 and 43, The short diagonal lines above and below the main 
diagonal reflect duplicated sequences in the molecules. 

This program is particularly useful for the comparison of a sequence 
with itself (omitting the comparison of identical segments) to show the 
existence and location of single or multiple repeated segments. We have 
developed a screening procedure to detect major duplications, using the 
RELATE Magnitude property, and applied it to the families in the 
database. Some of the most pronounced duplications are shown in 
Table VI. The number of highest scores to be used was determined for 
each protein from the segment length, 5, and the total length of the se- 
quence, L: LI2 — 5- + 1 . This expression is equal to the number of scores to 
be expected from comparisons of corresponding segments if the sequence 
has exactly doubled. A protein that has duplicated will have many high 
scores for segments that are displaced by half of its total length. A protein 
with a prominent 10-residue periodicity will have many high scores at 
displacements of 10, 20, 30, etc. Such additional matches also appear on 
the plot, as shown in Fig. 7 for the ferredoxin molecule. 

Although this procedure provides an easy and straightforward crite- 
rion for internal duplication, it may fail to detect duplications in certain 



W. C. Barker, L. K. Ketcham, and M. O. Dayhoff, 7. Mol. EvoL 10, 265 (1978). 
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cases. If a duplication involves only a small fraction of the sequence, it 
may not be detected unless it is composed of amino acids that are usually 
highly conserved in proteins. We have purposely designed the procedure 
to detect major duplication and extensive periodicity. If too many changes 
have occurred in the sequence, an ancestral duplication may not be de- 
tected. For detecting ancient duplications, we use the mutation data ma- 
trix and segments of at least 20 residues. When looking for small, recent 
duplications, we use the unitary matrix and a length of 5 or 10 residues. 

Scoring Matrices 

The methods for detecting distant relationships described above de- 
pend on an amino acid pair score matrix.^^ The simplest of these, the 
unitary matrix (UM), assigns a value of H- 1 to identical residues and 0 to 
nonidentical ones. A slightly more complicated scoring system reflects the 
maximum number of identities in the nucleotides of the genes coding for 
the proteins. Identical amino acids obtain a score of 3; those for which two 
nucleotides could be identical, 2; one nucleotide, 1; and 0 if no nucleotides 
are ever shared in the codons for the amino acids. We refer to this as the 
genetic code matrix (GCM), In 1971, a matrix based on alternative amino 
acids (AAAM) at each position in alignments of groups of related se- 
quences was derived by McLachlan.^^ In 1968, we described a method to 
derive scoring matrices for sequences at any evolutionary distance, based 
on amino acid replacements between present-day sequences and those 
inferred as common ancestors on evolutionary trees. ^ The residues that 
did not change and the relative exposure of the sequences to mutational 
change were taken into account. In 1978 we rederived the mutation data 
matrix (MDM) on the basis of 1572 mutations observed in families of 
closely related sequences.^***** 

These raw data were converted to a Mutation Probability Matrix. An 
element of this matrix gives the probability that the amino acid in column j 
will be replaced by the amino acid in row / after a given evolutionary 
interval, in this case 1 percent accepted mutation (PAM) in a sequence of 

A. D. McLachlan,7. Mol. Biol. 61, 409 <1971). 

M, O. Dayhoff and R. V. Eck, ''Atlas of Protein Sequence and Structure 1967-1968/' pp. 
33^41. National Biomedical Research Foundation, Silver Spring, Maryland, 1968. 
*® M. O, Dayhoff, R. M. Schwartz, and B.C. Orcutt, in "Atlas of Protein Sequence and 
Structure" (M. O. Dayhoff, ed.), Vol. 5, Suppl. 3, pp, 345-352. National Biomedical 
Research Foundation, Washington, D.C., 1979. 

R. M. Schwartz and M. O. Dayhoff, in "Origin of Life: Proceedings of the Second 
ISSOL Meeting, Fifth ICOL Meeting" (H. Noda, ed.), pp. 457-469, Center for Academic 
Publications Japan/ Japan Scientific, Tokyo, 1978. 
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average amino acid composition. There is a different mutation probability 
matrix for each evolutionary distance. The 1-PAM matrix can be multi- 
plied by itself N times to yield a matrix that predicts the amino acid 
replacements to be found after N PAMs of evolutionary change in a se- 
quence of average composition. To derive a scoring matrix from a proba- 
bility matrix, each element, representing the probability of an exchange 
due to a mutation, is divided by the probability that two amino acids v^ill 
be found by chance. The log of this ratio gives the element of the mutation 
data scoring matrix (MDM). Using programs ALIGN and RELATE and 
distantly related sequences, we compared a number of matrices including 
UM, GCM, AAAM, and mutation data matrices of different evolutionary 
distances. From these tests we concluded that all these matrices give very 
high scores with closely related sequences but that the MDM matrix for 
250 PAMs (Fig. 3) is the best matrix for detecting distantly related se- 
quences.^^ 

Comments 

The methods described above or procedures very similar to them have 
been used by ourselves and others to elucidate a number of surprising 
relationships, among which are the catalytic chain of bovine cychc AMP- 
dependent protein kinase and the src gene products of Rous avian and 
Moloney murine sarcoma viruses^^; ai-microglobulin, a2u-globulin, lac- 
toglobulin, and plasma retinol-binding protein^2.53. antithrombin-III, a^- 
antitrypsin, and ovalbumin*^; epidermal growth factor and the Ught chain 
of coagulation factor X^; the leech protease inhibitor eglin C and potato 
chymotrypsin inhibitor I, A chain*"^; and apolipoproteins A-I, A-II, C-I, 
and C-III.^ The reader can, with these procedures, readily verify the 
above relationships or possibly discover new ones. 
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